HDDS-14921. Improve space accounting in SCM with In-Flight container allocation tracking. by ashishkumar50 · Pull Request #10000 · apache/ozone

ashishkumar50 · 2026-03-30T04:23:43Z

What changes were proposed in this pull request?

Maintain space accounting during container allocation in SCM. More detail description is in Jira.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14921

How was this patch tested?

UT and IT. (More test scenario TBD)

…allocation tracking.

rakeshadr

Thanks @ashishkumar50 for providing the patch. Added a few comments, please take care.

rakeshadr · 2026-04-01T11:29:18Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

+            if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) {
+              ((ContainerManagerImpl) getContainerManager())
+                  .getPendingContainerTracker()
+                  .removePendingAllocation(dd, id);


Say, DN is healthy, all containers confirmed, no new allocations → that DN's bucket never rolls even though heartbeats come every 30 seconds, right?

t=0 Container C1 allocated → pending recorded in tracker t=60-120 FCR arrives from DN → cid = C1 → alreadyInDn = expectedContainersInDatanode.remove(C1) = FALSE → !alreadyInDn = TRUE → removePendingAllocation called → rollIfNeeded fires ✓ → C1 added to NM DN-set

How abt rolls on every processHeartbeat, every 30 seconds regardless of container state changes ?

rakeshadr · 2026-04-01T11:47:21Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+      }
+
+      // Cleanup empty buckets to prevent memory leak
+      if (bucket.isEmpty()) {


Potentially hits concurrency issue. Say two threads entered this block.

Thread-1 (removePendingAllocation): bucket.isEmpty(), returns true

Thread-2 (recordPendingAllocationForDatanode): computeIfAbsent(uuid) returns same bucket
reference (key still exists), calls bucket.add(containerID) and now the bucket will be non-empty

Thread-1: datanodeBuckets.remove(uuid, bucket), then removes the non-empty bucket and now the containerID will be in a detached bucket object, right?

I think, we need to add synchronization to avoid detached bucket object.

rakeshadr · 2026-04-01T11:55:20Z

...rver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestPendingContainerTracker.java

+  }
+
+  @Test
+  public void testRemoveFromBothWindows() {


Do we have test scenario covering roll over?

The two-window rolling behavior (container in previousWindow roll after 2× interval). Say, add C1 in currentWindow, then moves C1 to previousWindow, then wait for the roll over.

sumitagrawl

@ashishkumar50 Thanks for working over this, have few review comments.

sumitagrawl · 2026-04-01T12:07:22Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java

  )
  private int transactionToDNsCommitMapLimit = 5000000;

+  @Config(key = "hdds.scm.container.pending-allocation.roll-interval",


we can have config as hdds.scm.container.pending.allocation.roll.interval

sumitagrawl · 2026-04-01T13:07:50Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+   * @param pipeline The pipeline where container is allocated
+   * @param containerID The container being allocated
+   */
+  public void recordPendingAllocation(Pipeline pipeline, ContainerID containerID) {


This needs to be part of SCMNodeManager, more specific to SCMNodeStat. Reason,

need handle even like stale node / dead node handler as cleanup

May need report this when reporting to CLI for available space in the DN

To be used for pipeline allocation policy, where container manager does not come in role
Its datanode space, just trying to identify already allocated space. And needs to be part of committed space at SCM when reporting to CLI, or other breakup.

sumitagrawl · 2026-04-01T13:09:49Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

        final ContainerInfo container;
        try {
+          // Check if container is already known to this DN before adding
+          boolean alreadyOnDn = false;


Do we really need check if container exist? or just remove if exist as single call ?

Agreed. Instead of checking and then removing. We can just check the pending list and remove it. It will be the same, We can just avoid one extra op.

Just checked, This inherently does a copy of the containers with a TreeSetgetExisting(id).copyContainers();, We should avoid this at all cost.

sumitagrawl · 2026-04-01T13:12:20Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

            processContainerReplica(dd, container, replicaProto, publisher, detailsForLogging);
+
+            // Remove from pending tracker when container is added to DN
+            if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) {


Please check if node report is also send in ICR, this is for reason that node information should be updated with ICR at same time.

sumitagrawl · 2026-04-01T13:18:32Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+        // (1*5GB) + (2*5GB) = 15GB → actually 3 containers
+        long totalCapacity = 0L;
+        long effectiveAllocatableSpace = 0L;
+        for (StorageReportProto report : storageReports) {


Instead of calcuating all available and then removing, we can do progressive base, like,
required=pending+newAllocation
for each report
required = required - volumeUsage in roundoff value
if (required <= 0)
return true

But we need to reserve also, can do first add and check, if not present, remove containerId

OR other way,
when DN report storage handling, total consolidate value can also be added to memory to avoid looping on every call.

sumitagrawl · 2026-04-01T13:27:40Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+    bucket.rollIfNeeded();
+
+    // Each pending container assumes max size
+    return (long) bucket.getCount() * maxContainerSize;


This is costly operation as first it combine 2 list, and then it performs count.

sumitagrawl · 2026-04-01T13:30:34Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+   * Contains current and previous window sets, plus last roll timestamp.
+   */
+  private static class TwoWindowBucket {
+    private Set<ContainerID> currentWindow = ConcurrentHashMap.newKeySet();


if we use synchronized in all methods, then set need not be threadsafe.

sumitagrawl · 2026-04-01T13:32:02Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java

+              "After 2x this interval, allocations that haven't been confirmed via " +
+              "container reports will automatically age out. Default is 10 minutes."
+  )
+  private Duration pendingContainerAllocationRollInterval = Duration.ofMinutes(10);


rolling period is 5 min or 10 min ? mean previous bucket to be there for additional 5 min to capture containerList IN-Progress

sumitagrawl · 2026-04-01T13:33:34Z

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReportHandler.java

+            // Remove from pending tracker when container is added to DN
+            // This container was just confirmed for the first time on this DN
+            // No need to remove on subsequent reports (it's already been removed)
+            if (container != null && getContainerManager() instanceof ContainerManagerImpl) {


code handling different from ICR and FCR, can be same only.

aswinshakil · 2026-04-02T20:54:41Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+      return true;
+    } catch (Exception e) {
+      LOG.warn("Error checking space for pipeline {}", pipeline.getId(), e);
+      return true;


If we are not sure if we can create container here, Should we still choose this pipeline? Instead of making it generic, we can specify what to do for each exception we might see.

aswinshakil · 2026-04-02T21:02:09Z

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReportHandler.java

+            // Remove from pending tracker when container is added to DN
+            // This container was just confirmed for the first time on this DN
+            // No need to remove on subsequent reports (it's already been removed)
+            if (container != null && getContainerManager() instanceof ContainerManagerImpl) {


Why not just add this to the ContainerManager interface? We can avoid these conversions. Is this because Recon uses the same code path and we don't want it to this? For Recon we can just make it a No-Op.

aswinshakil · 2026-04-02T21:05:36Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

        final ContainerInfo container;
        try {
+          // Check if container is already known to this DN before adding
+          boolean alreadyOnDn = false;


Agreed. Instead of checking and then removing. We can just check the pending list and remove it. It will be the same, We can just avoid one extra op.

aswinshakil · 2026-04-02T21:52:30Z

...p-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java

  }
+
+  @Override
+  public org.apache.hadoop.hdds.scm.node.DatanodeInfo getDatanodeInfo(DatanodeDetails datanodeDetails) {


Nit: We can import DatanodeInfo

aswinshakil · 2026-04-02T22:01:02Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+     * Get count of all pending containers (union).
+     */
+    synchronized int getCount() {
+      return getAllPending().size();


Can we just return the count of both windows instead of adding them to separate set just to get the count here.

aswinshakil · 2026-04-02T22:11:35Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+     * Roll the windows: previous = current, current = empty.
+     * Called when current time exceeds lastRollTime + rollIntervalMs.
+     */
+    synchronized void rollIfNeeded() {


Pending allocations can persist beyond 2× roll interval after long idle periods because rollIfNeeded() only rolls once.A single roll doesn’t clear entries older than two windows which can incorrectly block new allocations.

aswinshakil · 2026-04-02T22:14:32Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+        long effectiveRemaining = effectiveAllocatableSpace - pendingAllocations;
+
+        // Check if there's enough space for a new container
+        if (effectiveRemaining < maxContainerSize) {


This makes the allocation little aggressive right? Even if we just have 5GB we allocate it. Should we have leave some buffer when allocating a container?

aswinshakil · 2026-04-02T22:21:57Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+    bucket.rollIfNeeded();
+
+    boolean added = bucket.add(containerID);
+    LOG.info("Recorded pending container {} on DataNode {}. Added={}, Total pending={}",


We can change all the LOG to debug. This will become too noisy on the SCM log for every allocation and removal.

aswinshakil · 2026-04-02T22:22:26Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

+      bucket.rollIfNeeded();
+
+      boolean removed = bucket.remove(containerID);
+      LOG.info("Removed pending container {} from DataNode {}. Removed={}, Remaining={}",


Same as above for all the LOG in this class.

ashishkr200 added 4 commits March 30, 2026 02:54

HDDS-14921. Improve space accounting in SCM with In-Flight container …

511ffd0

…allocation tracking.

Fix PMD

32caf74

Safe cast ozone conf

8706c1c

Fix test case

634c94e

ashishkumar50 marked this pull request as draft March 30, 2026 04:24

ashishkumar50 requested review from aswinshakil, rakeshadr, sumitagrawl and szetszwo March 30, 2026 04:39

ashishkr200 added 3 commits March 30, 2026 23:40

update test case

4150f5d

Handle unregistered datanodes

0895224

Fix test

5008cd4

rakeshadr reviewed Apr 1, 2026

View reviewed changes

sumitagrawl reviewed Apr 1, 2026

View reviewed changes

aswinshakil reviewed Apr 2, 2026

View reviewed changes

Conversation

ashishkumar50 commented Mar 30, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

rakeshadr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants